home *** CD-ROM | disk | FTP | other *** search
- .TI Keeping watch over the flocks
- .TI by night (and day)
- .AU 7
- Kenneth Ingham
- University of New Mexico Computing Center
- Distributed Systems Group
- 2701 Campus NE
- Albuquerque, NM 87131
- (505) 277-8044
- ingham@charon.unm.edu or ucbvax!unmvax!charon!ingham
- .AB
- Over the last several years, the number of machines maintained by the
- University of New Mexico Computing Center has increased rapidly, yet
- the number of system managers monitoring these systems has remained
- static. Consequently, the system managers were faced with the task of
- watching more and more machines; since only one system manager is on
- call at any time (known affectionately as "DOC"), this soon proved to
- be an unacceptable situation. Shell scripts running every six hours
- gave some assistance; this was offset by the fact that the scripts
- generated a great deal of output indicating normal system operation,
- which the system manager still had to scan carefully for signs of
- trouble. This paper describes \fIwatcher\fR, a flexible system monitor
- which watches the system more closely than the human system manager
- while generating less output for him to examine.
- .sp
- Running more often than the above mentioned set of shell scripts,
- \fIwatcher\fR is able to keep closer tabs on the system; since it
- delivers only a list of potential problems, however, this extra
- monitoring produces \fIno\fR corresponding increase in the demand on
- DOC. No problems slip by unnoticed in the more concise output,
- leading to an improvement in overall system availability as well as the
- more effective utilization of the system manager's time.
- .BD
- .SE 0. Acknowledgments (I couldn't have done it without you)
- I would like to thank Leslie Gorsline for her assistance in the writing
- of this paper. Without her, this paper might not have been. Also
- thanks to the UNMCC distributed systems group for their comments that helped
- improve \fIwatcher\fR.
- .SE 1. Background (the problem)
- The computing facilities offered by the University of New Mexico
- Computing Center (UNMCC) include three microvaxen, five large vaxen
- (780 or bigger), and a Sequent B8000. In addition to these Unix/VMS
- machines, the UNMCC Distributed Systems Group (DSG) monitors a number
- of the various microvaxen and sun workstations scattered across
- campus. This duty falls to the DSG Programmer designated as "DOC", or
- "DSG On Call", who receives his beeper based on a monthly rotation
- schedule.
- .sp
- In the past, shell scripts running every six hours reported various
- system statistics to DOC, who then scanned the output for signs of
- possible trouble. The output of these shell scripts became
- overwhelming as the number of machines and potential problems grew;
- corresponding to this increase in output was an increase in the amount
- of time that DOC had to spend reading this output. In addition, most
- of this output merely indicated normal system operation; potential
- problems were buried amongst non-problems. Because of this, DOC could
- often waste a tremendous amount of time wading through system status
- reports, time which can be better spent actually fixing system
- problems.
- .sp
- Unix is equipped with many powerful tools for program development, but
- none which simply watch the system for signs of trouble. Programs like
- \fIps\fR and \fIdf\fR provide information regarding the current state
- of the machine, yet it still remains DOC's responsibility to interpret
- this information and assess the health of the system at any given
- time. This deficiency can be rectified by providing the
- system with the capacity to determine its own state of health, advising
- DOC when it notices a problem which requires DOC's intervention.
- .SE 2. Design Goals (devising the solution)
- In designing \fIwatcher\fR, the author closely examined just what DOC
- does in monitoring the system; just how \fIdoes\fR DOC spot potential
- trouble in the DOC reports? These reports consist of output from \fIdf
- -i\fR, \fIruptime\fR, \fIps -aux | sort\fR, and the tail of
- \fIcronlog\fR, which usually only changes in the middle of the night.
- It was determined that DOC's task consisted primarily of scanning
- various numbers in this output, deciding whether or not they had
- exceeded an allowable maximum or minimum, or if the values had changed
- too much from the last time the command was run, assuming the last
- value is even remembered. Getting a computer to do this is more
- complicated than might seem at first glance, due to inconsistencies in
- the location of pertinent information between runs of these commands.
- For instance, the process occupying the fifth line of \fIps -ax\fR
- might next time appear on the eighth line; similarly, \fIuptime\fR does
- not consistently put germane information in the same place on the line.
- .sp
- While flexibility is certainly a primary design consideration, it is
- not the whole story. In order to improve DOC's effectiveness, the
- program should run frequently, roughly every two or three hours,
- catching problems early (hopefully before they have affected
- the users). Thus, the program should also be as silent as possible
- except when it detects a potential problem; any advantage DOC gains in
- using \fIwatcher\fR would be eliminated if the program delivered an
- exceedingly verbose status report every two hours. \fIwatcher\fR's
- problem reports should be exact and concise, leading DOC immediately to
- the trouble.
- .sp
- The problem of reducing the amount of output DOC must process can be
- approached in different ways, including the redesign of the current
- shell scripts. A simple \fIawk\fR script can watch the output from
- \fIdf\fR [1]. However, each command would require a custom tailored
- \fIawk\fR script to look at it. This task grows more complicated as the
- number of programs running increases.
- While a program could be written to
- generate these \fIawk\fR scripts, this process is needlessly complex;
- for only a bit more work, an efficient C program such as
- \fIwatcher\fR can be developed.
- .SE 3. Design (actual implementation of the solution)
- Run at intervals specified in \fIcrontab\fR, \fIwatcher\fR parses a
- control file (./\fIwatcherfile\fR by default)
- with a \fIyacc\fR generated parser, building a data structure
- containing all of the information from the file. The file contains the
- list of commands \fIwatcher\fR
- should run (the pipeline), output specifications
- for each command (the output format), and the guidelines used in
- determining if something is amiss and should be reported to DOC (the
- change format). A sample \fIwatcher\fR control file would look
- something like this (comment lines begin with a '#'):
- .EX
- # Here is the pipeline and its alias:
- (df -i | /usr/ucb/tail +2) { df }
- # the output format; this is a column output format:
- $1-9 device%k $41-42 spaceused%d $64-65 inodesused%d:
- # and the change format:
- spaceused 15%;
- spaceused 0 89;
- inodesused 15%;
- inodesused 0 49.
-
- # another command example:
- (/usr/ucb/ruptime | fgrep -f UnmHosts) { ruptime }
- # this is a relative output format
- 2 status%s 1 machine%k 7 loadav%d:
- # and another change format:
- loadav 0 10;
- status "up".
- .NX
- The first entry causes \fIwatcher\fR to run the \fIdf\fR pipeline
- listed in parentheses. When reporting problems, \fIwatcher\fR refers
- to this command by the alias provided in the braces; if no alias
- appears, \fIwatcher\fR uses the entire pipeline.
- .sp
- The output format
- instructs \fIwatcher\fR how to parse the output;
- column format, indicated in the output format by \fBnum-num\fR,
- instructs \fIwatcher\fR that the output should be parsed
- by columns, while relative format, denoted by a single integer, shows
- that the output should be broken up by whitespaces.
- Through the convention \fBname%type\fR, the output format also names each
- field, indicating whether the field is numeric, string, or
- keyword, specified by \fBd\fR, \fBs\fR, or \fBk\fR respectively.
- Keyword fields are
- used to match up corresponding output lines between runs. Thus
- .EX
- 41-42 spaceused%d
- .NX
- indicates that this field, named \fBspaceused\fR, contains numeric
- information in columns 41-42, while
- .EX
- 2 status%s
- .NX
- informs \fIwatcher\fR that the second word (group of non-whitespace
- characters) on the line is a string field named \fBstatus\fR.
- For the \fIdf\fR example given above,
- .EX
- Filesystem kbytes used avail capacity iused ifree %iused Mounted on
- /dev/hp1f 52431 39763 7424 84% 6937 9447 42% /develop
- .NX
- \fBdevice\fR would be \fI/dev/hp1f\fR, \fBspaceused\fR would be 84,
- and \fBinodesused\fR would be 42. Similarly, the output from the
- \fIruptime\fR example, which looks like this
- .EX
- charon up 26+07:53, 17 users, load 3.12, 2.90, 2.66
- .NX
- would be broken at the following places:
- .EX
- charon | up | 26+07:53, | 17 | users, | load | 3.12, | 2.90, | 2.66,
- .NX
- assigning "up" to \fBstatus\fR, and 3.12 to \fBloadav\fR.
- .sp
- The name field also appears in the change format, designating allowable
- values for this field to have. These values can be specified as
- single character strings in the case of string fields; in the case of
- numeric fields, the values take the form of either
- percentage or absolute changes, or a minimum and maximum which delineate
- an acceptable range.
- Thus
- .EX
- inodesused 15%;
- inodesused 0 49.
- .NX
- signifies that DOC should be notified if the field named \fBinodesused\fR
- increases by more than 15% from the last run, or if it is outside the
- range 0 to 49; similarly
- .EX
- status "up";
- .NX
- informs \fIwatcher\fR to notify DOC if the \fBstatus\fR field contains
- anything other than the word "up".
- .sp
- As \fIwatcher\fR parses the output of a pipeline, it stores the
- pertinent parts of the output in a history file (by
- default, ./\fIwatcher.history\fR).
- The next time \fIwatcher\fR runs, it reads this file to provide
- comparison values for the command. If a command is new (i.e. it has no
- previously-stored output in the history file), \fIwatcher\fR checks the
- fields which require no previous data, such as min-max fields, while
- still storing \fIall\fR of the relevant information to the history file.
- Thus, the next time the new command is run, it will be an \fIold\fR command,
- and meaningful between-run comparisons can be made.
- .sp
- When \fIwatcher\fR
- detects no problems with the system, DOC receives an empty mail message
- with the subject "\fIhostname\fR had no problems at \fIdate\fR";
- this is to insure that \fImail\fR is running correctly.
- When it notices a problem which should be brought to DOC's attention,
- it mails the system problem report in a concise
- format, explaining what is wrong and why.
- Thus, rather than the megabytes of shell script output that DOC
- used to receive and have to read,
- he merely sees this when he reads his mail:
- .EX
- Mail version 5.2 6/21/85. Type ? for help.
- "/usr/spool/mail/ingham": 5 messages 5 new
- N 1 root@charon.unm Sat Apr 11 16:00 8/212 "charon had no problems at Sat"
- N 2 root@ariel.unm Sat Apr 11 16:00 8/208 "ariel had no problems at Sat "
- N 3 root@geinah.unm Sat Apr 11 16:00 11/417 "System problem report for gei"
- N 4 root@izar.unm Sat Apr 11 16:00 8/204 "izar had no problems at Sat A"
- N 5 root@deimos.unm Sat Apr 11 16:00 8/212 "deimos had no problems at Sat"
- .NX
- The letters indicating no problems can be immediately deleted, and DOC
- can turn his attention to the letter indicating a
- system problems. A sample problem report
- would look something like this:
- .EX
- df has a max/min value out of range:
- /dev/hp0h 140488 111195 15244 91% 10145 28767 26% /usr
- where spaceused = 91.00; valid range 0.00 to 89.00.
- Also it had inodesused change by more than 10%.
- Previous value 20.00; current value 26.00.
- .NX
- Note that if a line has more than one indication of a problem, all
- anomalies are included in the report.
- This provides DOC with as much information as possible, allowing him
- to determine the problem quickly and devise
- a rapid fix (hopefully before users know something is amiss).
- .sp
- .SE 4. Results (how its helped us)
- \fIwatcher\fR's primary advantage lies in the reduction of DOC's work
- load. It has taken over the more menial aspects of monitoring a system,
- tasks like reading and comparing numbers,
- giving DOC more time to concentrate on bugs of a nature which
- \fIwatcher\fR isn't set up to monitor, such as problems in the
- accounting system.
- DOC is apprised of potential problems quickly, and in
- some cases can repair them in less time than simply
- reading the shell script output
- would have taken.
- .sp
- The ability to monitor changes between runs has also helped bring to our
- attention some
- problems which were missed in the DOC reports. For example,
- disk space on \fI/u2\fR on one of our machines jumped by more than 15%. Since
- this jump did not force the total space used above 90%, at which point
- DOC would have investigated the filesystem, it is unlikely
- that DOC would have even noticed this sudden change. The facility to
- watch for relative changes between runs enables DOC to catch problems in
- their infancy, and fix problems such as filesystems filling up too
- rapidly before they inconvenience the users.
- .sp
- Since the system manager specifies not only the commands \fIwatcher\fR will
- execute and the time lapse between successive runs, but also the
- parameters which indicate system anomalies,
- \fIwatcher\fR can easily be seen as a very flexible, general system
- monitor. Its use at UNM has provided an increase in the
- productivity of the system manager, which has led in turn to the
- increase in the reliability and availability of the systems at UNMCC.
- .SE 5. Availability (how to get one)
- \fIwatcher\fR will be sent to the moderator of mod.sources after the
- conference is over.
- .SE 6. References (you might also find this interesting)
- .in +0.5i
- .ti -0.5i
- [1] Monitoring Free Disk Space, Rik Farrow, Wizard's Grabbag, \fIUnix
- World\fR, Vol. IV, no. 3, pp. 86-87.
- .in -0.5i
-